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Abstract 

Modularizing a large optimization problem so that 
the solutions to the subproblems provide a good 
overall solution is a challenging problem. In this 
paper we present a multi-agent approach to this 
problem based on aligning the agent objectives 
with the system objectives, obviating the need to 
impose external mechanisms to achieve collabora- 
tion among the agents. This approach naturally ad- 
dresses scaling and robustness issues by ensuring 
that the agents do not rely on the reliable operation 
of other agents. 

We test this approach in the difficult distributed op- 
timization problem of imperfect device subset sele- 
tion [Challet and Johnson, 2002]. In this problem, 
there are n devices, each of which has a “distor- 
tion”, and the task is to find the subset of those n 
devices that minimizes the average distortion. Our 
results show that in large systems (1000 agents) the 
proposed approach provides improvements of over 
an order of magnitude over both traditional opti- 
mization methods and traditional multi-agent meth- 
ods. Furthermore, the results show that even in ex- 
treme cases of agent failures (i.e., half the agents 
fail midway through the simulation) the- system re- 
mains coordinated and still outperforms a failure- 
free and centralized optimization algorithm. 


1 Introduction 

Modularizing a large optimization problem so that the solu- 
tions to the subproblems provide a good overall solution is 
a challenging problem. Similarly, coordinating a large num- 
ber of agents to achieve complex tasks collectively presents 
new challenges to the field of multi-agent systems. In this 
work we leverage recent advances in the field of multi-agent 
coordination to modularize and solve a difficult optimization 
problem. 

In truly large problems, many of the reasonable assump- 
tions used in multi-agent coordination problems related to a 
handful of agents are difficult to justify. When dealing with a 
small number of agents it is reasonable to assume that agents 
react to one another, can model one another, and/or enter 


into contracts with one another [Clement and Durfee, 1999; 
Decker and Lesser, 1995; Hu and Wellman, 1998; Sandhoim 
and Lesser, 199 71. When d ealin g with thousands of agents 
on the other hand, such assumptions become more difficult 
to maintain. At best each one can assume that the agents are 
aware of other agents as part of a background. In such cases, 
agents have to act within an environment that may be shaped 
by the actions of other agents, but cannot be interpreted as the 
the by-product of the actions of any single agent. 

In this work, we focus on an agent coordination method 
that aims to handle large systems composed of simple agents 
and where those agents are failure-prone. The size of the sys- 
tem (e..g., a thousand agents) naturally requires methods that 
do not rely on detailed information about the actions of all 
the agents being available to a given agent. The simplicity 
of the agents requires methods where the interactions among 
the agents provides the systems' strength rather than the so- 
phistication of each individual agent. Finally, the propensity 
of the agents to sudden failures requires solutions where one 
agent’s actions are not rigidly dependent on the actions of 
other agents. 

The problem of imperfect device subset selection intro- 
duced by Challet and Johnson [Challet and Johnson, 2002] 
provides the perfect setting to test the efficacy of the multi- 
agent method. In this problem, there are n objects, each 
of which has a “distortion”. The task is to find the sub- 
set of those n objects that minimizes the average distor- 
tion. This is a hard optimization problem, and brute force 
approaches cannot be used for any but its smallest toy in- 
stances [Challet and Johnson, 2002; Garey and Johnson, 
1979]. We propose to address this problem by associating 
each device with a simple adaptive Reinforcement-Learning 
(RL) agent [Kearns and Koller, 1999; Littman, 1994; Sut- 
ton and Barto, 1998]) that decides whether or not its de- 
vice will be a member of the subset. The essential prob- 
lem is to determine how best to set the agent utility func- 
tions (e.g., subsystem objective functions) in a way that will 
lead to good values of the global utility (e.g., global objective, 
in this case average distortion), without involving difficult 
to scale external mechanisms to ensure cooperation among 
the agents. This problem has been explored in many dif- 
ferent domains, including multi agent systems, distributed 
optimization, computational economics, mechanism design, 
computational ecologies and game theory [Boutilier, 1996; 


Sandholm and Lesser, 1997; Huberman and Hogg, 1988; 
Parkes, 2001; Stone and Veloso, 2000]. (See [Turner and 
Wolpert, 2004] for a survey of the different approaches taken 
in each field to this problem.) 

This paper presents an agent coordination method well- 
suited for large and noisy multi-agent systems, and is tested 
on a difficult distributed optimization problem. In Section 2 
we focus on the properties the agent utilities must possess 
to lead to coordination. In Section 3 we present the imper- 
fect device combination problem and derive the specific agent 
utilities for this domain. In Section 4 we present results show- 
ing that the proposed method outperforms traditional methods 
by up to an order of magnitude, has superior scaling proper- 
ties and is resistant to severe cases of agent failure. Finally, in 
Section 5 we provide a summary and discuss the implications 
and general applicability of this work. 


2 Background 

For the joint action of agents working in a large system to pro- 
vide good values of the global utility, we must ensure that: (i) 
the agents’ goals support the global goal; and (ii) each one 
has a solvable problem. The first of these two desirable prop- 
erties is that the agent utilities have to be aligned with the 
global utility. For discrete states, this can be formalized by 
assessing the “degree of factoredness” between any two util- 
ities [Agogino and Turner, 2004], which gives the fraction of 
actions for an agent where the agent utility and global utility 
have same delta (e.g., if agent utility* goes up so dees global 
utility and vice versa). 

The second properties assures that the agent utilities have 
high “learnability” for the agent. Intuitively, learnability pro- 
vides the sensitivity of an agent’s utility to its own actions. 
It is computed by dividing the expected value of changes in 
agent f s utility caused changes in agent i's actions to the ex- 
pected value of changes in agent i’s utility caused by changes 
in the actions of agents other than i [Wolpert and Turner, 
2001; Agogino and Turner, 2004]. So at a given state, the 
higher the learnability, the more the utility of agent i depends 
on the move of agent i, i.e., the better the associated signal - 
to-noise ratio for i. Higher learnability is desirable because it 
makes it is easier for i to achieve a large values of its utility. 

Typically these two requirements are in conflict with one 
another. As a trivial example, a system in which all the agent 
utility functions are set to the global utility is fully factored. 
However, such a system will have low learnability since each 
agent’s utility will depend on the actions of all the other 
agents in the system. It will be nearly impossible for the 
agents to determine the best actions to follow in most non- 
trivial systems. At the other extreme, providing each agent 
with a simple, local utility function will result in high learn- 
ability for the agents’ utilities, but will not necessarily lead 
the system to high values of global utility, unless the degree 
of factoredness is also high. The challenge we faced then, is 
to find agent utilities in a given domain, with the best trade- 
off between these two requirements. 



3 Combination of Imperfect Devices 

We now present the difficult optimization problem of combin- 
ing imperfect devices [Challet and Johnson, 2002]. A typical 
example of this problem arises when many simple and noisy 
observational devices (e.g., nano or micro devices, low power 
sensing devices) attempt to accurately determine some value 
pertinent to the phenomenon they’re observing. Each device 
will provide a single number that is slightly off, similar to 
sampling a Gaussian centered on the value of the real num- 
ber. The problem is to choose the subset of a fixed collection 
of such devices so that the average (over the members of the 
subset) distortion is as close to zero as possible. 


3.1 Problem Definition 

Formally, the problem is to minimize 



where nj € {0, 1} is whether device j is or is not selected, 
and there are N devices in the collection, having associ- 
ated distortions {clj}. This is a hard optimization prob- 
lem that is similar to known NP-complete problems such 
as subset sum or partitioning [Challet and Johnson, 2002; 
Garey and Johnson, 1979], but has two twists: the presence 
of the denominator and that aj € R V j. In this work we set G 
the system-level, global utility function to G = — e (we do this 
so that the goal is to “maximize” G , which is more consistent 
with the concept of “utility” function). G is a function of the 
full state system z (e.g., joint moves of all the agents). 

The system is composed of N agents, each responsible for 
setting one of the nj. Each of those agent has its own util- 
ity function that it is trying to maximize, though the overall 
objective is to maximize global performance. Our goal is to 
devise agent utility functions that will cause the multi-agent 
system to produce high values of G(z) [Agogino and Turner, 
2004; Wolpert and Turner, 2001]. 


3.2 Expected Difference Utility 

Now let us present the first of two utilities that possess the 
desirable properties discussed in Section 2, the Estimated 
Difference Utility (EDU). EDU aims to isolate the impact 
of an agent on the full system by focusing on the difference 
between the actual impact the agent has and its “expected” 
impact. Let, E Zi [G(z)\z~i\ provide the expected value of G 
over the possible actions of agent z, where Zi denotes the state 
of agent i and z~i denotes the states of all agents other than i. 
Then EDU for this application becomes: 



where p(rii =1) and p{n\ = 0) give the probabilities that 
agent i set its n\ to 1 or 0 respectively. In what follows, we 



will assume that those two actions are equally likely (i.e., for 
all agents i, n(n; = 1 ) = /?(/?. = -0) = 0 5). 

For each agent, EDU is fully factored with G because the 
second term does not depend on agent f s state [Wolpert and 
Turner, 2001] (these utilities are referred to as AU in [Wolpert 
and Turner, 2001]). Furthermore, because it removes noise 
from an agent’s utility, EDU yields far better learnability than 
does G [Wolpert and Turner, 2001]. This noise reduction is 
due to the subtraction which (to a first approximation) elimi- 
nates the impact of states that are not affected by the actions 
of agent i. 

Depending on which action agent i chose (0 or 1), EDU 
can be reduced to: 


EDUi(z) = 0.5 


or: 


njdj - Oi\ 



if rij = 1 . 


EDUi(z) = 0.5 
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if m = 0 . 


Note that in this formulation, EDU provides a very clear 
signal. If EDU is positive, the action taken by agent i was 
beneficial to G, and if EDU is negative, the action was detri- 
mental to G. Thus an agent trying to maximize EDU will 
efficiently maximize G, without explicitly trying to do so. 
Furthermore, note that the computation of EDU requires very 
little information. Any system capable of broadcasting G can 
be minimally modified to accommodate EDU. For each agent 
to compute its EDU, the system needs to broadcast the two 
numbers needed to compute G: the number of devices that 
were turned on (i.e., the denominator in Equation 1) and the 
associated subset distortion as a real number (i.e., the numer- 
ator in Equation 1 before the absolute value operation is per- 
formed. Based on those two numbers, the agent can compute 
its EDU. 


3.3 Wonderful Life Utility 

The second utility we present is the Wonderful Life Utility 
(WLU) which aims to isolate the impact of an agent on the 
full system by focusing on the difference between the impact 
of the agent and its “disappearance” from the system [Wolpert 
and Turner, 2001]. WLU for this application becomes: 

WLUi(z) = G(z) — G(z-i) 



The major difference between EDU and WLU is in how 
they handle the noise removing second term. EDU provides 
an estimate of agent i s impact by sampling all possible ac- 
tions of agent i whereas WLU simply removes agent i from 
the system. WLU is also factored with G, because the second 
term does not depend on the actions of agent i (ie., both WLU 
and G have the same derivative with respect to Zh the state °f 
agent i [Wolpert and Turner, 2001]. Note however, that unlike 
with EDU, the action chosen by agent i has a large impact on 
the efficiency of WLU. If agent i chooses action 0, the two 
terms in Equation 4 are identical, resulting in a WLU of zero. 


Depending on which action agent i chose (0 or I), WLU can 
be reduced to* 


WLUi(z) 

or: 




nL'wi 


if n i — 1 


(5) 


WLUi(z) = 0 if n, = 0 . (6) 

In this formulation, unlike EDU, WLU provides a clear sig- 
nal only if agent i had chosen action 1 . In that case, a positive 
WLU means that the action was beneficial to G, and a nega- 
tive WLU means that the action was detrimental for G. How- 
ever, if agent i had chosen action 0, it receives a reward of 0 
regardless of whether that action was good or bad for G. This 
means that on average half the actions an agent takes will be 
random as far as G is concerned. Considering learnability im- 
plications, this means that on average WLU will have half the 
-lear-nabilityMDL^DUfor-this problem, 


4 Experimental Results 

In this work we purposefully used computationally unsophis- 
ticated and easy to build agents for the following reasons: 

1. To ensure that we remained consistent with our pur- 
pose of showing that a large scale system of potentially 
failure-prone agents can be coordinated to achieve a 
global goal. Indeed, building thousands of sophisticated 
agents may be prohibitively difficult; therefore though 
systems that will scale up to thousands may use sophis- 
ticated agents, they cannot rely on such sophistication. 

2. To focus, on the design of the utility functions. Having 
sophisticated agents can obscure the differences in per- 
formance due to the agent utility functions and the algo- 
rithms they ran. By having each agent run a very simple 
algorithm we kept the emphasis on the effectiveness of 
the utility functions. 

Each agent had a data set and a simple reinforce- 
ment learning algorithm. Each agents’ data set contained 
{time, act ion, utility value} triplets that the agent stored 
throughout the simulation. At each time step each agent chose 
what action to take, which provided a joint action which in 
turn set the system state. Based on that state the global util- 
ity, and the agent utility for all agents were computed. The 
new {time, act ion, utility value} for agent i is then added to 
the data set maintained by agent i. This is done for all agents 
and then the process repeats. 

To choose its actions, an agent uses its data set to estimate 
the values of the utility it would receive for taking each of 
its two possible move. Each agent i picks its action at a time 
step based on the utility estimates it has for each possible ac- 
tion. Instead of simply picking the largest estimate, to pro- 
mote exploration it probabilistically selects an action, with 
a higher likelihood of selecting the actions with higher util- 
ity estimates, e.g., it uses a Boltzmann distribution across the 
utility values [Sutton and Barto, 1998]. Because the experi- 
ments were run for short periods of time, the temperature in 
the Boltzmann distribution did not decay in time. Finally, to 
form the agents’ initial data sets, there is an initialization pe- 
riod in which all actions by all agents are chosen uniformly 



randomly, with no learning used. It is after this initialization 
period ends that the agents choose their actions according to 
the associated Boltzmann distributions. 

For all learning algorithms, the first 20 time steps consti- 
tute the data set initialization period (note that all learning 
algorithms must “perform” the same during that period, since 
none are actually in use then). Starting at t = 20, with each 
consecutive time step a fixed fraction of the agents switch 
to using their learner algorithms instead, while others con- 
tinue to take random actions. Because the behavior of the 
agents starting to use their learning algorithm changes, hav- 
ing all agents start learning simultaneously provides a sudden 
“spike” into the system which significantly slows down the 
learning process. This gradual introduction of the learning 
algorithms is intended to soften the “discontinuity” in each 
agent’s environment. In these experiments, for N = 50 and 
N = 100, three agents turned on their learning algorithms at 
each time step, andTorTV = 1000, sixty agentsTuni^'orrfheir 
learning algorithms at each time step. 

4.1 Agent Utility Performance 

Figures 1-2 show the convergence properties of different 
agent utilities and a search algorithm in systems with 100 
and 1000 agents respectively. The results reported are based 
on 20 different {a/} configurations, where each {a,-} is se- 
lected from a Gaussian distribution with zero mean and unit 
variance. For each configuration, the experiments were run 
50 times (i.e., each point in the graphs is the average of 
20 x 50 = 1000 runs). In all cases, the Boltzmann parame- 
ter (e.g., temperature x) was set to 0.1. The graphs labeled G, 
EDU and WLU show the performance of agents using rein- 
forcement learners with those reinforcement signals provided 
by G, EDU and WLU respectively. S shows the performance 
of local search where new rif s are generated at each step by 
perturbing the current state and selected if the solution is bet- 
ter than the current best solution (in the experiments reported 
here, 25% of the actions were randomly changed at each time 
step, though somewhat surprisingly, the results are not partic- 
ularly sensitive to this parameter). Because the runs are only 
200 time steps long, algorithms such as simulated annealing 
do not outperform local search: there is simply no time for an 
annealing schedule. This local search algorithm provides the 
performance of an algorithm with centralized control. 

In both cases in which agents use the G utility, they have 
a difficult time learning. The noise in the system is too large 
for such agents to learn how to select their actions. For 100 
agents (Figure 1), WLU performs at the level of the central- 
ized algorithm. Because agents only receive useful feedback 
when they take one of the two actions, the noise in the sys- 
tem is larger than that for EDU. This “noise” becomes too 
much for systems with 1000 agents (Figure 2), where WLU 
is outperformed by the centralized algorithm. EDU, on the 
other hand, continues to provide a clean signal for all systems 
up to the largest we tested (1000 agents). Note that because 
agents turning on their learning algorithm changes the envi- 
ronment, the performance of the system as whole degrades 
immediately after learning starts (i.e., after 20 steps) in some 
cases. Once agents adjust to the new environment, the system 
settles down and starts to converge. 



Figure- T I - Performance— of— the-- three- utiHtyfunctions^for 
N=100. 



Figure 2: Performance of the three utility functions for 
N=1000. 

4.2 Scaling Characteristics of Utilities 

Figure 3 shows scaling results (the t = 200 average perfor- 
mance over 1000 runs) along with the associated error bars 
(differences in the mean). As N grows two competing factors 
come into play. On the one hand, there are more degrees of 
freedom to use to minimize G. On the other hand, the prob- 
lem becomes more difficult: the search space gets larger for 
5, and there is more noise in the system for the learning algo- 
rithms. To account for these effects and calibrate the perfor- 
mance values as N varies, we also provide the baseline per- 
formance of the “algorithm” that randomly selects its action 
(“Ran”). Note that the difference between the performances 
of all algorithms and EDU increases when the system size 
increases, reaching a factor of twenty for S and over 600 for 
G for N = 1000. 

Also note that all algorithms but EDU have slopes simi- 
lar to that of “Ran”, showing that they cannot use the addi- 
tional degrees of freedom provided by the larger N. Only 
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EDU effectively uses the new degrees of freedom, providing 
gains that are proportionally higher than the other algorithms 
(i.e., the rate at which EDU ' s performance improves outpaces 
what is “expected” based on the Ran ' s performance). 

4.3 Robustness 

In order to evaluate the robustness of the proposed utility 
functions for multiagent coordination, we tested the perfor- 
mance of the system when a subset of the agents failed dur- 
ing the simulation. At a given time (t ~ 100 in these ex- 
periments), a certain percentage of agents failed (e.g., were 
turned off) simulating hazardous condition in which the func- 
tioning of the agents cannot be ascertained. The relevance of 
this experiment is in determining whether the proposed utility 
functions require all or a large portion of the agents to per- 
form well to be effective, or whether they can handle sudden 
changes to their environment. 

Figure 4 shows the performance of EDU, WLU, and G for 
100 agents when 20% of the agents fail at time step t ~ 100. 
The results of the centralized search algorithm with no fail- 
ures (“S” from Section 4.1), is also included for comparison. 

In these experiments, none of the agent learning algorithms 
were adjusted to account for the change in the environment. 

In agents that continued to function, the learning proceeded 
as though nothing had happened. As a consequence, not only 
did the agents need to overcome the sudden change in their 
task but they had to do so with parameters tuned to the pre- 
vious environment. Despite these limitations, EDU recovers 
rapidly for the 100 agent case, whereas G and WLU do not. 
Note this is a powerful results: a distributed algorithm with 
only 80% functioning agents tuned to a different environment 
outperforms a 100% functioning centralized algorithm. 

Figure 5 show the performance of EDU when the per- 
centage of agent failures increases from 10 to 50% for 100 
agents. For comparison purposes, the search results (from 
Section 4.1) are also included. After the initial drop in per- 
formance when the agents stop responding, EDU trained al- 
gorithms recover rapidly and even with half the agents outper- 
form the fully functioning and centralized search algorithm. 
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Figure 5: Effect of agent failures on EDU for 100 agents (S 
has no failures). 

These results demonstrate both the adaptability of the EDU 
and its robustness to failures of individual agents, even in ex- 
treme cases. 

5 Discussion 

The combination of imperfect devices is a simple abstraction 
of a problem that will loom large in the near future: How 
to coordinate a very large numbers of agents - with limited 
sophistication and failure prone - to achieve a prespecified 
global objective. This problem is fundamentally different 
from traditional multi- agent coordination (and distributed op- 
timization) problems in at least three ways: (i) the agents are 
simple and do not model the actions of other agents; (ii) the 
agents are unreliable and failure-prone; and (iii) the number 
of agents is in the thousands. 

The work summarized in this paper is based on ensuring 
coordination while eliminating external mechanisms such as 
contracts and incentives to allow the systems to scale. In the 






experimental domain of selecting a subset of imperfect de- 
vices, the results show the promise of this method by provid- 
ing performance improvements of twenty fold over a central- 
ized algorithm and of nearly three orders of magnitude over a 
multi-agent system using the global utility (G) directly. Fur- 
thermore, when as many as half the agents fail during simula- 
tions, the proposed method still outperforms a fully function- 
ing centralized search algorithm. 

This approach is well-suited for addressing coordination in 
large scale cooperative multi-agent systems where the agents 
do not have pre-set and possibly conflicting goals, or when 
the agents do not need to hide their objectives. The focus is 
on ensuring that the agents do not inadvertently frustrating 
one another in achieving their goals. The results show that in 
such large scale, failure-prone systems, this method performs 
well precisely because it does not rely on the agents building 
an accurate model of their surroundings, modeling the actions 
"of other agents or requiring afl”agenTsmrfe~^ 
a minimum performance level. 
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